What is Reddit?



Reddit is an American social news aggregation, web content rating, and discussion website. Registered members submit content to the site such as links, text posts, and images, which are then voted up or down by other members.

What is Bot?



Bot is a very common object inside reddit community. It is created by other users to respond to other comments and the bot can perform different kinds of tasks such as gathering information, create meme or just leave a stupid message to other users.


How does it work?



Parent Comment is the top level comment that would contains other comments and users usually summon the bots in this level.

Bot Comment is the second level comment triggered by the call function from other users.

Child Comment is the third level comment and users would use this level to reply the bot comments.

chat box

Who is Bobby-B?



Bobby-B is a Bot based on quotes (64 of them). This bot will reply to the parent comments with random quotes from the character, King Robert Baratheon, from Game Of Thrones.




bob




So… I want to see the interaction between the bot and human.




Data Preparation & Wrangling



I have used praw package to scrape down the comments from reddit.

posts = []
replies_code = []
clean_parent = []
botnamecol = []


bot_names = list(bots["Bot Name"])
bot_names_small = bot_names[:11]
for name in bot_names_small:
  try:
    for comment in reddit.redditor(name).comments.new(limit = 1000):
        posts.append([comment.link_title,comment.link_url, comment.body,comment.link_permalink,comment.ups,comment.downs,comment.subreddit,comment.id,comment.link_id,comment.parent_id,comment.score ])
        replies_code.append(str([comment.id])[2:-2])
        botnamecol.append(name)
  except:
    continue


posts = pd.DataFrame(posts,columns=['title','post_link', 'comment','link','upvote','downvote','subreddit','id','link_id','parent_id','comment karma'])

clean_sub = []

for a in posts['link_id']:
  sub = a[3:]
  clean_sub.append([sub])
  
submission = pd.DataFrame(clean_sub,columns=['sub_id'])

newsubmission = []
for a in submission['sub_id']:
  if reddit.submission(a).selftext:
    newsubmission.append(reddit.submission(a).selftext)
  else:
    newsubmission.append('Submission without text')  



for parent_id in posts['parent_id']:
  parent_id2 = parent_id[3:]
  clean_parent.append([parent_id2])
  
newparent = pd.DataFrame(clean_parent,columns=['clean_parent_id'])

newparentcomment = []
for a in newparent['clean_parent_id']:
  try:
    newparentcomment.append(reddit.comment(a).body)
  except:
    newparentcomment.append('Same as post')   

posts['parent comment'] = newparentcomment
posts['bot name'] = botnamecol


replies3 = []
for code in replies_code:
  try:
    comment = reddit.comment(code)
    comment.reply_sort = 'new'
    comment.refresh()
    replies = comment.replies
    replies.replace_more(limit=None)
    if replies:
      sub_replies = []
      for x in replies:
        sub_replies.append(x.body)
      replies3.append(' '.join(sub_replies))
    else:     
      replies3.append('No replies')
  except:
    
    replies3.append('No replies because bot does not work')



posts['replies'] = replies3
posts['post_text'] = newsubmission
posts = posts[['bot name','title','post_text','post_link', 'parent comment','comment','replies','link','upvote','downvote','comment karma','subreddit','id','link_id','parent_id']]

posts.to_csv(r"/content/drive/Shared drives/Reddit_Bot_Project/top10_final_oresentation.csv")
library(rjson)
library(dplyr)
library(jsonlite)
library(tidytext)
library(tidyverse)
library(dplyr)
library(stringr)
library(stm)
library(textstem)
library(tm)
library(ggplot2)
library(DT)

bobby_comment <- fromJSON("bobby-b-bot.comments.json")
bobby_parent <- fromJSON("bobby-b-bot.parentcomments.json")
bobby_child <- fromJSON("bobby-b-bot.childcomments.json")

jockers<- lexicon::hash_sentiment_jockers ### I decided to use jockers as my main lexiocn reference

bobby_child$parent_id <- sub("t1_","",bobby_child$parent_id)  ### Remove id label

bobby_child <- bobby_child %>%
  mutate(child_text = text) %>%
  select(-post_id, -text)

bobby_child <- bobby_child %>%  ### Paste all the child comment together that are belong to the same bot comment            
  group_by(parent_id)%>%
  summarise( replies = paste(child_text, collapse = " | ") )

bobby_comment_child <- bobby_comment %>%
  inner_join(bobby_child, by = c("comment_id" = "parent_id"))

bobby_comment_child <- bobby_comment_child %>%            ### Count how many child comment under the bot comment   
  mutate(repliescount = stringr::str_count(.$replies,'\\|'),
         repliescount = ifelse(repliescount>0,repliescount+1,1))

bobby_comment_child$parent_id <- sub("t1_","",bobby_comment_child$parent_id)

bobby_comment_child$sub_id <- sub("t3_","",bobby_comment_child$sub_id)

bobby_parent1 <- bobby_parent %>%
  mutate(parent_text = text) %>%
  select(-text)

bobby_comment_child2<- bobby_comment_child %>%
  left_join(bobby_parent1, by = c("parent_id" = "parent_id"))

bobby_comment_child3 <- bobby_comment_child2 %>%
  select(parent_text,text,score,replies,repliescount)


bobby_comment_child4 <- bobby_comment_child3 %>%    ### Clean the text data
  mutate(parent_text = as.character(parent_text),
         parent_text = str_replace_all(parent_text, "\n", " "),   
         parent_text = str_replace_all(parent_text, "(\\[.*?\\])", ""),
         parent_text = str_squish(parent_text), 
         parent_text = gsub("([a-z])([A-Z])", "\\1 \\2", parent_text), 
         parent_text = tolower(parent_text), 
         parent_text = removeWords(parent_text, c("’", stopwords(kind = "en"))), 
         parent_text = removePunctuation(parent_text), 
         parent_text = removeNumbers(parent_text),
         parent_text = textstem::lemmatize_strings(parent_text),
         text = as.character(text),
         text = str_replace_all(text, "\n", " "),   
         text = str_replace_all(text, "(\\[.*?\\])", ""),
         text = str_squish(text), 
         text = gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", "",text),
         text = gsub("([a-z])([A-Z])", "\\1 \\2", text), 
         text = tolower(text), 
         text = removeWords(text, c("’", stopwords(kind = "en"))), 
         text = removePunctuation(text), 
         text = removeNumbers(text),
         text = textstem::lemmatize_strings(text),
         replies = as.character(replies),
         replies = str_replace_all(replies, "\n", " "),   
         replies = str_replace_all(replies, "(\\[.*?\\])", ""),
         replies = str_squish(replies), 
         replies = gsub("([a-z])([A-Z])", "\\1 \\2",replies), 
         replies = tolower(replies), 
         replies = removeWords(replies, c("’", stopwords(kind = "en"))), 
         replies = removePunctuation(replies), 
         replies = removeNumbers(replies),
         replies = textstem::lemmatize_strings(replies)) %>%
  as.data.frame()

bobby_comment_child4 <- bobby_comment_child4 %>%     ### Assign id for later lexicon join
  mutate(id = 1:n())


test1 <-  bobby_comment_child4 %>%  ### Inner join lexicon with parent comment
  unnest_tokens(word,parent_text) %>%
  inner_join(jockers, by = c("word" = "x"))

test1$y[is.na(test1$y)] <-  0

parent_text_t <- test1 %>%
  group_by(id)%>%
  summarise(parent_text_score = round(mean(y[y!=0]),2))


test2 <-  bobby_comment_child4 %>%  ### Inner join lexicon with bot comment
  unnest_tokens(word,text) %>%
  inner_join(jockers, by = c("word" = "x"))

test2$y[is.na(test2$y)] <-  0

text_t <- test2 %>%
  group_by(id)%>%
  summarise(text_score = round(mean(y[y!=0]),2))


test3 <-  bobby_comment_child4 %>%  ### Inner join lexicon with child comment
  unnest_tokens(word,replies) %>%
  inner_join(jockers, by = c("word" = "x"))

test3$y[is.na(test3$y)] <-  0

replies_t <- test3 %>%
  group_by(id)%>%
  summarise(replies_score = round(mean(y[y!=0]),2))

bobby_comment_child_score<- bobby_comment_child4%>%   ### Create the new table with scores
  inner_join(parent_text_t, by = c("id" = "id"))%>%
  inner_join(text_t, by = c("id" = "id"))%>%
  inner_join(replies_t, by = c("id" = "id"))

### Group by bot comments in order to filter out unqualified bot comment (ex. bot's system fail message )
group_attempt <- bobby_comment_child_score %>%       
  group_by(text) %>%
  summarise(avg_replies_score= mean(replies_score),
            avg_parent_score= mean(replies_score),
            avg_text_score= mean(text_score),
            count = n()) %>%
  filter(count>1) %>%
  arrange(desc(avg_replies_score),desc(count))


bobby_comment_child_score2 <- bobby_comment_child_score %>%
  inner_join(group_attempt, by = c("text" = "text")) 

bobby_comment_child_score2 <- bobby_comment_child_score2 %>%   ### Reassign the id
  select(-id)%>%
  mutate(id = 1:n())

Data Exploration


Now, I am more interested in how many child comments belong to the same bot comment because the sentiment will be affected by how many comments and words within the comment.

As you can see from the graph, most of the bot comments get one to three replies, so I decide to focus on those who has more than one child comment to make sure our sentiment analysis is more reasonable.



After the distribution graph, I run the linear regression to see the correlation between parent comments and child comments; we get a pretty decent p-value, indicating that there is a strong correlation, so I move on to plot the graph.

## 
## Call:
## lm(formula = parent_text_score ~ replies_score, data = bobby_comment_child_score2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.11869 -0.42500  0.03416  0.44975  0.99081 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.063938   0.004818  13.270  < 2e-16 ***
## replies_score 0.054752   0.008729   6.272 3.67e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5401 on 12869 degrees of freedom
## Multiple R-squared:  0.003048,   Adjusted R-squared:  0.00297 
## F-statistic: 39.34 on 1 and 12869 DF,  p-value: 3.672e-10

Recommendations & Conclusion


Positive Bot Comment Graph gives us an idea that when the bot sentiment score is very positive, the child comment has positve relation with parent comment. Negative Bot Comment Graph also indicates positive correlation and has a slightly steeper slope.

Things become interesting, if you look closer and compare the graphs, you will find out that at Negative Bot Comment Graph, the starting point of the regression line is below 0; on the other hand, Positive Bot Comment Graph has a starting point above 0. The difference in starting points indicates that, actually, the bot comments function as a buffer between parent comment and child comment.

We could apply this knowledge into the chat bot software, so the bot can calculate the sentiment score from the complainers and respond back with sentences having accurate sentiment to calm the clients down.




td-idf


In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.



The top 15 words that are important in each level are listed below. It is interesting to see that, the bot seems to mention words that are provocative and dramatic to trigger users to respond. As for the child comment, sentiment tends to be very important and it implies that the bot comment is very human-like.

### Only keep the words with mroe than two characters
bobby_comment_child3$parent_text<-gsub('\\b\\w{1,2}\\b','',bobby_comment_child3$parent_text)   
bobby_comment_child3$text <- gsub('\\b\\w{1,2}\\b','',bobby_comment_child3$text)
bobby_comment_child3$text <- gsub("?(f|ht)tp(s?)://(.*)[.][a-z]+","",bobby_comment_child3$text)
bobby_comment_child3$replies <-gsub('\\b\\w{1,2}\\b','',bobby_comment_child3$replies)

parent <- paste(bobby_comment_child3$parent_text, collapse = ",")
bot_text <- paste(bobby_comment_child3$text ,collapse = ",")
child <- paste(bobby_comment_child3$replies,collapse  = ",")

text_data = data.frame(level = c("parent", "bot_text", "child"), 
                      text = c(tolower(parent), tolower(bot_text), 
                                 tolower(child)), 
                      stringsAsFactors = FALSE)




textTF = text_data %>% 
  split(., .$level) %>%
  lapply(., function(x) {
    textTokens = tm::MC_tokenizer(x$text)
    tokenCount = as.data.frame(summary(as.factor(textTokens), maxsum = 1000))
    total = length(textTokens)
    tokenCount = data.frame(count = tokenCount[[1]], 
                            word = row.names(tokenCount),
                            total = total,
                            level = x$level,
                            row.names = NULL)
    return(tokenCount)
  }) 

textTF = do.call("rbind", textTF)  

textTF$tf = textTF$count/textTF$total


### idf

idfDF = textTF %>% 
  group_by(word) %>% 
  count() %>% 
  mutate(idf = log((length(unique(textTF$level)) / n)))


### tf-idf 

tfidfData = merge(textTF, idfDF, by = "word")
tfidfData$tfIDF = tfidfData$tf * tfidfData$idf

### top 15

tfidfData %>% 
  group_by(level) %>% 
  arrange(level, desc(tfIDF)) %>% 
  slice(1:15) %>% 
  rmarkdown::paged_table()




N-grams


N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-occuring words within a given window and when computing the n-grams you typically move one word forward. In short, it is a fun way to see how the words structured in the bot comment. It seems like Bob needs mooooooooar wine!




Topic Models


Topic modeling is an unsupervised machine learning technique that’s capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents.

The graph below is showing that how do we decide how many topics (k) are within the documents.The 4 plots are going to help us determine the best number of topics to take. I would like to focus on semantic coherence (how well the words hang together – computed from taking a conditional probability score from frequent words) and the residuals(the difference between the observed value of the dependent variable and the predicted value). We want to have low residual and high semantic coherence. The residuals definitely take a sharp dive as we increase K. I decide to use k = 6, indicating that there are 6 topics within the bot comment section.

I have found this website have introduced Topic Models pretty well!

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Creating Output...



The following is showing that the proportion of each topic, with a few of the highest probability words.



I generate 6 plots with child comment sentiment score as the x axis to see how do people react with different topcis in bot comment.

The first two graphs, they are showing positive relationship between the topic and sentiment score. Words like good, warm and honor do seem right to increase the sentiment score as it is taking more portion. Surprisingly, we can see that pregnant, breed and child could also help to increase the sentiment score.

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Creating Output...



Compared with others, these two are showing negative relationship. As we mention more about words like stupid, pay and fear, it is reasonable to assume that these words will decrease the sentiment score. Also, it is funny to see that when Ned Stark was mentioned multiple times, the score goes down.



The following graphs are the most interesting ones.When seven kingdom and kill are mentioned, people tends to dislike it. As for the right hand side, Bessie is Robert Baratheon’s mistress and when it is mentioned multiple times, people react in a very positive way.




Data Set


The following table is the data set and you can play around with it!




Thank You

 

A work by Michael Ma

mma4@nd.edu